\(R^4H_2O\)

Level Two

Peter Prevos

17 March 2022

Program

Day 1:

  1. Analysing customer experience
  2. Data cleaning
  3. Survey reliability
  4. Survey validity

Day 2:

  1. Customer segmentation
  2. Cluster analysis
  3. Linear regression

Course Book

lucidmanager.org/r4h2o

Course Project

github.com/pprevos/r4h2o

  • RStudio Cloud: New project from GitHub repository
  • RStudio Desktop: Download or clone project files

Introductions

Customer Experience

Water Utility marketing

Gruen Transfer (2009), Season 2, Episode 3.

From STEM to STEAM

Water management applies the physical sciences.

Marketing applies the social sciences.

Customer Surveys

Case Study 2 survey

  1. Consent and screening
  2. Consumer Involvement (Ten-item semantic differential scale)
  3. Contact frequency (1-item Likert scale)
  4. Perceived hardship (1-item Likert scale)
  5. Service Quality (18-item Likert Scale)
  6. Trap question (1-item Likert scale)

Consumer Involvement

  • Importance of the product to a consumer
    • Cognitive (rational)
    • Affective (emotional)

Personal Involvement Inventory

Zaichkowsky, J. L. (1994). The personal involvement inventory: revision, and application to advertising. Journal of Advertising, 23(4), 59.

  • Which items are reversed polarity?
  • Which items are cognitive and which items are affective?

Service Quality

ServAqua Development

ServAqua

Case Study 2

  1. Open Rstudio in the R4H2O project
  2. Activate the Tidyverse libraries
  3. Load and explore the data:
    • casestudy2/customer_survey.csv

Cleaning Data

Open chapter-08.R script

Data Structures

Joining Data

vignette("two-table")

Tidyverse Data Cleaning

dplyr.tidyverse.org

vignette("dplyr")

tidyr.tidyverse.org

vignette("tidy-data")

Tidy Data

  1. Each variable forms a column
  2. Each observation forms a row
  3. Each type of observational unit forms a table

Pivoting Data

pivot_longer(data, 2:3, names_to = "year", values_to = "cases")

pivot_wider(data, names_from = year, values_from = cases)

vignette("pivot")

Missing Data

  • Missing Completely at Random (statistical error)
  • Missing Not At Random (remove)

Measuring Mental States

Personal Involvement Index

Reliability and Validity

Survey Reliability

  • Correlations
  • Cronbach’s Alpha

Covariance and Correlation

\[cov(x, y) = \frac{\sum_{i=1}^n(x_i-\bar{x})(y_i-\bar{y})}{n-1}\]

\[cor(x,y) = \frac{cov(x, y)}{s_x s_x}\]

\[s_x = \sqrt{\frac{\sum_{i=1}^n(x_i-\bar{x})^2}{n-1}}\]

Survey Validity

  • Face validity: Do the survey questions at face-value relate to the mental state?
  • Content validity: Does the survey instrument capture all relevant components of the latent variable?
  • Construct validity: How much variance does the model describe?
  • Discriminant validity: How different is the scale from other scales?

Exploratry Factor Analysis

Example: Trust Scale

Questions

  • How much can you count on … ?
  • How much do you trust … ?
  • How dependable is … ?

Answer Model

7-point Likert scale

Source

Bruner, G. (2012). Marketing Scales Handbook. A Compilation of Multi-item Measures for Consumer Behavior & Advertising Research. GCBII Productions.

Reliability

  • Alpha = 0.92

Validity

Factor analysis supports single-dimensionality and discriminant validity.

Customer Segmentation

Hierarchical Cluster Analysis

  1. Pre-process the data
  2. Scale the data
  3. Calculate the distances
  4. Cluster the data
  5. Review the outcome

Scaling

  • Wide data frame
  • Columns: Features
  • Rows: Cluster variable

\[x_s = \frac{x_i - \bar{x}}{s_x}\]

Distance calculations

  1. Euclidean distance
  2. Manhattan (taxi cab) distance
  3. Maximum (max \(\Delta x, \Delta y \ldots\))

Euclidean and Taxi Cab distance

Agglomerative Clustering

  1. Find two closest points
  2. Cluster the two points
  3. Find next two closets points / clusters
    • Average distance
    • Minimum distance
    • Maximum distance
  4. Cluster the two points
  5. Continue until all points are in one cluster

k-means Clustering

  1. Specify the number of clusters (k)
  2. Select randomly k objects
  3. Assigns each observation to their closest centroid
  4. For the k clusters, update the centroid
  5. Iteratively minimize the total within sum of square

Linear Regression

\[\hat{y} = \beta_0 + \beta_1 x\] \[SS = \sum_{i=1}^n (y_i - \hat{y})^2\] \[\beta_1 = cor(y,x) \frac{s_y}{s_x}\] \[\beta_0 = \bar{y} - \beta_1 \bar{x}\]

Anscombe’s Quartet

Datasaurus

Case Study

Open chapter-11.R script